DATA-608 Final Report

Devin Norris, Joel Penner, Sevgi Sarac Yilmaz, Shiyao Zhang

01/04/2021

INTRODUCTION

Historical Temperature Change

_IMDb (an acronym for Internet Movie Database)¹ is an online database of information related to films, television programs, home videos, video games, and streaming content online – including cast, production crew and personal biographies, plot summaries, trivia, ratings, and fan and critical reviews. An additional fan feature, message boards, was abandoned in February 2017. Originally a fan-operated website, the database is now owned and operated by IMDb.com, Inc., a subsidiary of Amazon._

_As of December 2020, IMDb has approximately 7.5 million titles (including episodes) and 10.4 million personalities in its database,² as well as 83 million registered users._

IMDb began as a movie database on the Usenet group "rec.arts.movies" in 1990 and moved to the web in 1993. source

We intended to use a dataset consisting of IMDb information to create a machine learning model which takes a movie’s features as input and predicts which potentially ratings. This tool would be useful for measuring the popularity of new films based on their features. Based on this aim, we determined our guiding questions and these are given below:

  • Which features can be effective in the prediction of ratings?
  • How is the relation between ratings and other features?
  • How is the behaviour of top and bottom movies' features?
In [156]:
from IPython.display import HTML

HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
The raw code for this IPython notebook is by default hidden for easier reading.
To toggle on/off the raw code, click <a href="javascript:code_toggle()">here</a>.''')
Out[156]:
The raw code for this IPython notebook is by default hidden for easier reading. To toggle on/off the raw code, click here.
In [28]:
# Import the necessaries libraries
#data wrangling
import pandas as pd
import numpy as np
import datetime as dt
import glob
import random

#visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
import plotly.express as px

import plotly.offline as pyo
import plotly.graph_objs as go
#visualazation libraries
import plotly.express as px
import plotly.offline as pyo
import plotly.graph_objs as go
# Set notebook mode to work in offline
pyo.init_notebook_mode()

#ignore warnings
import warnings
warnings.filterwarnings("ignore")


Gathering Data

IMDb Datasets
Subsets of IMDb data are available for access to customers for personal and non-commercial use. You can hold local copies of this data, and it is subject to our terms and conditions. Please refer to the Non-Commercial Licensing and copyright/license and verify compliance.

Data Location

The dataset files can be accessed and downloaded from https://datasets.imdbws.com/. The data is refreshed daily.

IMDb Dataset Details

Each dataset is contained in a gzipped, tab-separated-values (TSV) formatted file in the UTF-8 character set. The first line in each file contains headers that describe what is in each column. A ‘\N’ is used to denote that a particular field is missing or null for that title/name. The available datasets are as follows:

title.akas.tsv.gz - Contains the following information for titles:⬇︎ - titleId (string) - a tconst, an alphanumeric unique identifier of the title
- ordering (integer) – a number to uniquely identify rows for a given titleId
- title (string) – the localized title
- region (string) - the region for this version of the title
- language (string) - the language of the title
- types (array) - Enumerated set of attributes for this alternative title. One or more of the following: "alternative", "dvd", "festival", "tv", "video", "working", "original", "imdbDisplay". New values may be added in the future without warning
- attributes (array) - Additional terms to describe this alternative title, not enumerated
- isOriginalTitle (boolean) – 0: not original title; 1: original title
title.basics.tsv.gz - Contains the following information for titles:⬇︎ - tconst (string) - alphanumeric unique identifier of the title
- titleType (string) – the type/format of the title (e.g. movie, short, tvseries, tvepisode, video, etc)
- primaryTitle (string) – the more popular title / the title used by the filmmakers on promotional materials at the point of release
- originalTitle (string) - original title, in the original language
- isAdult (boolean) - 0: non-adult title; 1: adult title
- startYear (YYYY) – represents the release year of a title. In the case of TV Series, it is the series start year
- endYear (YYYY) – TV Series end year. ‘\N’ for all other title types
- runtimeMinutes – primary runtime of the title, in minutes
- genres (string array) – includes up to three genres associated with the title
title.crew.tsv.gz – Contains the director and writer information for all the titles in IMDb. Fields include:⬇︎ - tconst (string) - alphanumeric unique identifier of the title
- directors (array of nconsts) - director(s) of the given title
- writers (array of nconsts) – writer(s) of the given title
title.episode.tsv.gz – Contains the tv episode information. Fields include:⬇︎ - tconst (string) - alphanumeric identifier of episode
- parentTconst (string) - alphanumeric identifier of the parent TV Series
- seasonNumber (integer) – season number the episode belongs to
- episodeNumber (integer) – episode number of the tconst in the TV series
title.principals.tsv.gz – Contains the principal cast/crew for titles:⬇︎ - tconst (string) - alphanumeric unique identifier of the title
- ordering (integer) – a number to uniquely identify rows for a given titleId
- nconst (string) - alphanumeric unique identifier of the name/person
- category (string) - the category of job that person was in
- job (string) - the specific job title if applicable, else '\N'
- characters (string) - the name of the character played if applicable, else '\N'
title.ratings.tsv.gz – Contains the IMDb rating and votes information for titles:⬇︎ - tconst (string) - alphanumeric unique identifier of the title
- averageRating – weighted average of all the individual user ratings
- numVotes - number of votes the title has received
name.basics.tsv.gz – Contains the following information for names:⬇︎ - nconst (string) - alphanumeric unique identifier of the name/person
- primaryName (string)– name by which the person is most often credited
- birthYear – in YYYY format
- deathYear – in YYYY format if applicable, else '\N'
- primaryProfession (array of strings)– the top-3 professions of the person
- knownForTitles (array of tconsts) – titles the person is known for

Loading files:

We converted all files that we will use for our analysis to parquet format. For this process, we chose to thread the pool mapping process for time-saving.

In [2]:
def convert_csv_to_parquet( src ):
    
    df = pd.read_csv(src,sep="\t",low_memory=False, na_values=["\\N","nan"])
    df.to_parquet(src.split(".tsv.gz")[0]+".parquet", compression='brotli')
In [3]:
%%time
import multiprocessing
from multiprocessing.pool import ThreadPool
import glob

files = glob.glob('*.tsv.gz')
pool = ThreadPool(processes=multiprocessing.cpu_count())
pool.map(convert_csv_to_parquet, files)
Wall time: 3min 7s
Out[3]:
[None, None, None, None, None, None, None]
In [29]:
df_akas = pd.read_parquet("title.akas.parquet")
df_basics = pd.read_parquet("title.basics.parquet")
df_ratings = pd.read_parquet("title.ratings.parquet")
df_principals = pd.read_parquet("title.principals.parquet")
df_crew = pd.read_parquet("title.crew.parquet")
print("Akas Table:")
display(df_akas.sample())
print("Basics Table:")
display(df_basics.sample())
print("Ratings Table:")
display(df_ratings.sample())
print("Principals Table:")
display(df_principals.sample())
print("Crew Table:")
display(df_crew.sample())
Akas Table:
titleId ordering title region language types attributes isOriginalTitle
11928679 tt1458605 3 The Queen Bee HK en None None 0.0
Basics Table:
tconst titleType primaryTitle originalTitle isAdult startYear endYear runtimeMinutes genres
3463660 tt1525954 tvEpisode Episode #1.40 Episode #1.40 0.0 1988.0 NaN None Game-Show
Ratings Table:
tconst averageRating numVotes
69804 tt0096721 6.8 9
Principals Table:
tconst ordering nconst category job characters
17185158 tt13123366 2 nm9324066 actress None ["Sam(antha)"]
Crew Table:
tconst directors writers
4881930 tt3809488 nm1313178 None


Data Wrangling & Cleaning

Our cleaning and wrangling activities included:

We changed the 'titleId' columns to 'tconst' to get the same key column names.

  1. Merging:
    We merged dk_akas, df_basics and df_ratings respectively with using 'tconst' and we used outer join to not lose any data. We eliminated df_crew and df_principlas because they are not needed for our analysis.

  2. Limiting:
    We want to predict the ratings of movies, so we only chose movies. Additionally, because there is limited information about movies that filmed before 1950, we preferred to select after movies that are filmed after this date.

  3. Adding Latitude and Longitude

  4. Missing Values & Duplication
  5. Cleaning Genres Column
  6. Adding Top Actors and Directors Table
In [30]:
df_akas.rename(columns = {'titleId':'tconst'},inplace = True)

# merging
df = pd.merge(df_akas,df_basics,on='tconst',how='outer')
df = pd.merge(df,df_ratings,on='tconst',how='outer')
df.head()
Out[30]:
tconst ordering title region language types attributes isOriginalTitle titleType primaryTitle originalTitle isAdult startYear endYear runtimeMinutes genres averageRating numVotes
0 tt0000001 1.0 Карменсіта UA None imdbDisplay None 0.0 short Carmencita Carmencita 0.0 1894.0 NaN 1 Documentary,Short 5.7 1688.0
1 tt0000001 2.0 Carmencita DE None None literal title 0.0 short Carmencita Carmencita 0.0 1894.0 NaN 1 Documentary,Short 5.7 1688.0
2 tt0000001 3.0 Carmencita - spanyol tánc HU None imdbDisplay None 0.0 short Carmencita Carmencita 0.0 1894.0 NaN 1 Documentary,Short 5.7 1688.0
3 tt0000001 4.0 Καρμενσίτα GR None imdbDisplay None 0.0 short Carmencita Carmencita 0.0 1894.0 NaN 1 Documentary,Short 5.7 1688.0
4 tt0000001 5.0 Карменсита RU None imdbDisplay None 0.0 short Carmencita Carmencita 0.0 1894.0 NaN 1 Documentary,Short 5.7 1688.0
In [31]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 27712889 entries, 0 to 27712888
Data columns (total 18 columns):
 #   Column           Dtype  
---  ------           -----  
 0   tconst           object 
 1   ordering         float64
 2   title            object 
 3   region           object 
 4   language         object 
 5   types            object 
 6   attributes       object 
 7   isOriginalTitle  float64
 8   titleType        object 
 9   primaryTitle     object 
 10  originalTitle    object 
 11  isAdult          float64
 12  startYear        float64
 13  endYear          float64
 14  runtimeMinutes   object 
 15  genres           object 
 16  averageRating    float64
 17  numVotes         float64
dtypes: float64(7), object(11)
memory usage: 3.9+ GB
In [32]:
df.describe(include=[np.number])
Out[32]:
ordering isOriginalTitle isAdult startYear endYear averageRating numVotes
count 2.545082e+07 2.544863e+07 2.770766e+07 2.342723e+07 201329.000000 3.398108e+06 3.398108e+06
mean 4.043114e+00 2.125026e-02 1.195550e-02 2.002023e+03 2004.118031 6.639795e+00 1.106034e+04
std 3.366887e+00 1.442175e-01 1.151953e+00 1.914265e+01 15.900392 1.353763e+00 6.466183e+04
min 1.000000e+00 0.000000e+00 0.000000e+00 1.874000e+03 1932.000000 1.000000e+00 5.000000e+00
25% 2.000000e+00 0.000000e+00 0.000000e+00 1.996000e+03 1996.000000 5.900000e+00 1.300000e+01
50% 4.000000e+00 0.000000e+00 0.000000e+00 2.009000e+03 2010.000000 6.700000e+00 7.800000e+01
75% 6.000000e+00 0.000000e+00 0.000000e+00 2.015000e+03 2017.000000 7.600000e+00 9.540000e+02
max 1.580000e+02 1.000000e+00 2.020000e+03 2.115000e+03 2030.000000 1.000000e+01 2.358614e+06
In [33]:
df.describe(include=[object])
Out[33]:
tconst title region language types attributes titleType primaryTitle originalTitle runtimeMinutes genres
count 27712889 25450817 24841622 21128532 1989842 217125 27707664 27707656 27707656 5288603 24878609
unique 7689234 3645764 245 104 24 188 13 3721955 3740024 829 2273
top tt0168366 Episodio #1.1 FR ja imdbDisplay transliterated title tvEpisode Episode #1.1 Episode #1.1 60 Drama
freq 158 76964 3140838 3020314 1245832 25824 23264755 269394 269394 276554 3955493
In [9]:
df.titleType.value_counts()
Out[9]:
tvEpisode       23078427
movie            2143925
short            1028703
video             418521
tvSeries          416418
tvMovie           271202
tvMiniSeries       70203
videoGame          48434
tvSpecial          43309
tvShort            13727
radioSeries            1
audiobook              1
episode                1
Name: titleType, dtype: int64
In [32]:
# limiting
df_movie = df.loc[df.titleType == 'movie']
df_movie = df_movie.loc[df.startYear > 1950]
df_movie = df_movie[['tconst','title', 'region', 'language', 'types','primaryTitle', 'originalTitle', 'startYear', 'runtimeMinutes', 'genres', 'averageRating', 'numVotes']]
df_movie.sample(3)
Out[32]:
tconst title region language types primaryTitle originalTitle startYear runtimeMinutes genres averageRating numVotes
22803237 tt8420422 Humihazushita haru: Aoi chibusa II JP None None Young Breasts II Humihazushita haru: Aoi chibusa II 1959.0 None Drama NaN NaN
10166634 tt1337621 Gratuitous Violence US None None Gratuitous Violence Gratuitous Violence 2007.0 None Drama NaN NaN
4613038 tt10758478 Monzetsu futamata: Nagare deruaieki None None original Monzetsu futamata: Nagare deruaieki Monzetsu futamata: Nagare deruaieki 2005.0 65 None NaN NaN
In [35]:
df_movie.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1751329 entries, 28837 to 27712073
Data columns (total 12 columns):
 #   Column          Dtype  
---  ------          -----  
 0   tconst          object 
 1   title           object 
 2   region          object 
 3   language        object 
 4   types           object 
 5   primaryTitle    object 
 6   originalTitle   object 
 7   startYear       float64
 8   runtimeMinutes  object 
 9   genres          object 
 10  averageRating   float64
 11  numVotes        float64
dtypes: float64(3), object(9)
memory usage: 173.7+ MB


Adding Latitude and Longitude

Because we have lots of countries, it will be hard to convert all of them as dummies so we decided to use coordinations of them instead of converting countries as dummies. So, we added latitude and longitude informations from an open use csv file.

In [33]:
df_loc= pd.read_csv("https://raw.githubusercontent.com/cristiroma/countries/master/data/csv/countries.csv",header=None)
df_loc.rename(columns = {0:'countries', 2:'region',4:'Latitude',5:'Longitude'},inplace = True)
df_loc.drop([1,3,6],axis=1,inplace=True)
df_loc.to_parquet(".parquet", compression='brotli')
In [34]:
df_movie = pd.merge(df_movie,df_loc,on='region',how='left')


Missing Values & Duplication

  1. Firstly, we found duplicated rows and removed them. After removing all duplicated rows, we realized that some movies are given repeatedly the same information but under different countries. Then we also realized when we limited our types to 'working', it gave us the right information about the movie as you can see below, so we limited our data set 'types' to 'working'.

Our target value is 'averageRating' so we should clean all missing values in this column and the others that we use as predictory features in our analysis. So when we examine each missing value:

  1. we should remove all missing values in the average rating column because filling them can miss leading our analysis.
  2. Location also can be an important feature, we have region, language, Countries, latitude and longitude are related to location. Because filling them impossible, we decided to use one of them that has the lowest missing value and not use the other column to minimize information loss.
  3. 'genres' also can be an important feature for our analysis, so we also dropped all missing values in 'genres'.
  4. title has a missing value, but we examine the dataset, we realized that title is the same as originalTitle so we decided to not use the title column.
  5. When we draw a boxplot of 'runtimeMinutes', we realized that it has an extremely high outlier. So, we decided to limit runtimes to 40000 minutes.
In [115]:
#1- find&remove duplications
sum(df_movie.duplicated(subset=['title','region','language','types','primaryTitle','originalTitle','startYear','runtimeMinutes','genres','averageRating','numVotes','countries']))
Out[115]:
242
In [35]:
df_movie = df_movie.drop_duplicates(subset=['title','region','language','types','primaryTitle','originalTitle','startYear','runtimeMinutes','genres','averageRating','numVotes','countries'])
In [117]:
df_movie.loc[(df_movie['originalTitle'] == 'The Shawshank Redemption') ]
Out[117]:
tconst title region language types primaryTitle originalTitle startYear runtimeMinutes genres averageRating numVotes countries Latitude Longitude
487166 tt0111161 Rita Hayworth and Shawshank Redemption US None working The Shawshank Redemption The Shawshank Redemption 1994.0 142 Drama 9.3 2358614.0 United States of America 37.668954 -102.392565
487167 tt0111161 Rastegari dar Shawshank IR fa imdbDisplay The Shawshank Redemption The Shawshank Redemption 1994.0 142 Drama 9.3 2358614.0 Iran (Islamic Republic of) 31.402403 51.282048
487168 tt0111161 Les Évadés FR None imdbDisplay The Shawshank Redemption The Shawshank Redemption 1994.0 142 Drama 9.3 2358614.0 France 46.483721 2.609263
487169 tt0111161 Avain pakoon FI None alternative The Shawshank Redemption The Shawshank Redemption 1994.0 142 Drama 9.3 2358614.0 Finland 64.696109 26.363391
487170 tt0111161 Os Condenados de Shawshank PT None imdbDisplay The Shawshank Redemption The Shawshank Redemption 1994.0 142 Drama 9.3 2358614.0 Portugal 39.448791 -8.037680
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
487226 tt0111161 Gaqceva shoushenkidan GE None imdbDisplay The Shawshank Redemption The Shawshank Redemption 1994.0 142 Drama 9.3 2358614.0 Georgia 41.827543 44.173299
487227 tt0111161 Die Verurteilten AT None imdbDisplay The Shawshank Redemption The Shawshank Redemption 1994.0 142 Drama 9.3 2358614.0 Austria 47.631255 13.187767
487228 tt0111161 Втеча з Шоушенка UA None imdbDisplay The Shawshank Redemption The Shawshank Redemption 1994.0 142 Drama 9.3 2358614.0 Ukraine 48.893586 31.105169
487229 tt0111161 Shoushenkden Qacish AZ None imdbDisplay The Shawshank Redemption The Shawshank Redemption 1994.0 142 Drama 9.3 2358614.0 Azerbaijan 40.353218 47.467064
487230 tt0111161 The Shawshank Redemption AU None imdbDisplay The Shawshank Redemption The Shawshank Redemption 1994.0 142 Drama 9.3 2358614.0 Australia -26.295946 133.555409

65 rows × 15 columns

In [119]:
df_movie.loc[(df_movie['originalTitle'] == 'The Shawshank Redemption') & (df_movie['types'] == 'working')]
Out[119]:
tconst title region language types primaryTitle originalTitle startYear runtimeMinutes genres averageRating numVotes countries Latitude Longitude
487166 tt0111161 Rita Hayworth and Shawshank Redemption US None working The Shawshank Redemption The Shawshank Redemption 1994.0 142 Drama 9.3 2358614.0 United States of America 37.668954 -102.392565
In [36]:
df_movie = df_movie.loc[df_movie.types == 'working']
In [121]:
#checking missing values
df_movie.isnull().sum()
Out[121]:
tconst                0
title                 0
region             4716
language          20480
types                 0
primaryTitle          0
originalTitle         0
startYear             0
runtimeMinutes     1659
genres              397
averageRating      2492
numVotes           2492
countries          1434
Latitude           1434
Longitude          1434
dtype: int64
In [37]:
#drop missing values
df_movie = df_movie.dropna(subset=['averageRating','Latitude','genres'])
In [38]:
#Distribution of Runtimes of Movies
df_movie.runtimeMinutes = df_movie.runtimeMinutes.astype(float)
ax = sns.boxplot(x=df_movie.runtimeMinutes)
In [39]:
df_movie = df_movie.loc[df_movie.runtimeMinutes < 40000]
In [40]:
ax = sns.boxplot(x=df_movie.runtimeMinutes)
In [41]:
#drop unwanted columns
df_movie = df_movie.drop(['title','region','language','types'],axis=1)
In [73]:
df_movie.isnull().sum()
Out[73]:
tconst            0
primaryTitle      0
originalTitle     0
startYear         0
runtimeMinutes    0
genres            0
averageRating     0
numVotes          0
countries         0
Latitude          0
Longitude         0
dtype: int64


Cleaning Genres Column

Because we have lots of genres, we wanted to simplify them for our analysis. So, firstly we looked at the top 25 genres distribution, then we selected the common seven of them. We used 'Other' for the rest of these genres.

In [127]:
df_movie.genres.value_counts()[:25]
Out[127]:
Drama                         2088
Comedy                        1337
Documentary                    770
Comedy,Drama                   701
Comedy,Romance                 470
Drama,Romance                  470
Comedy,Drama,Romance           456
Horror                         455
Action,Crime,Drama             305
Horror,Thriller                294
Drama,Thriller                 291
Thriller                       282
Crime,Drama                    237
Crime,Drama,Thriller           236
Western                        216
Action,Adventure,Comedy        207
Adventure,Animation,Comedy     172
Crime,Drama,Film-Noir          143
Horror,Mystery,Thriller        143
Action,Drama                   141
Action,Crime,Thriller          140
Action,Adventure,Sci-Fi        139
Action,Thriller                137
Comedy,Crime                   127
Action,Adventure,Fantasy       124
Name: genres, dtype: int64
In [128]:
def genre_finder(x):
    df_genres = set(x.split(','))
    extract_words =  genres_set.intersection(df_genres)
    return ','.join(extract_words)

genres_set = {'Drama', 'Comedy', 'Documentary', 'Horror', 'Thriller','Action','Western'}

df_movie['new_genres'] = df_movie.genres.apply(genre_finder)
name = df_movie['new_genres']
df_movie['new_genres'] = [i.split(",")[0].strip() for i in name]
df_movie['new_genres'] = df_movie.new_genres.replace('','Other')
In [129]:
df_movie.new_genres.value_counts()
Out[129]:
Drama          8715
Comedy         3030
Thriller       1670
Action         1546
Documentary    1328
Horror         1021
Other           771
Western         388
Name: new_genres, dtype: int64


Adding Top Actors and Directors Table

One of IMDB's greatest features is its exhaustive catalog of cast and crew information. How to include this information in a regression model proved to be one of the bigger challenges in this project. Our solution to capture this valuable information was to calculate a count of how many 'big name' actors star in a particular title. In order to complete this step, we wrote a web scraper to collect the IMDB identifiers 'nconst' from this list of the top 1000 actors. Of course this is a subjective evaluation of the best actors/actresses, but the list seems to reasonably cover the 'big name' actors of Hollywood from the past century of movies. The analogous procedure was also conducted for directors from this list of top directors.

In [29]:
df_actors = df_principals.loc[(df_principals.category == 'actor') | (df_principals.category == 'actress')]
df_actors = df_actors[['tconst', 'nconst']]
In [30]:
df_actors = pd.DataFrame(df_actors.groupby('tconst')['nconst'].apply(lambda x: ','.join(x)))
In [31]:
df_actors.sample()
Out[31]:
nconst
tconst
tt4032752 nm1257423
In [32]:
main_df = pd.merge(df_actors,df_crew,on='tconst',how='outer')
main_df = main_df.rename(columns={'nconst':'actors'})
main_df = main_df.drop('writers',axis=1)
main_df.sample(3)
Out[32]:
tconst actors directors
7474094 tt8928198 NaN None
5088712 tt12429540 NaN nm3514859
6551033 tt4509132 NaN None
In [ ]:
# # The following code block has been commented out because it ran as a separate script and 
# #its result has been saved into a separate file. Code is included for posterity.

# from bs4 import BeautifulSoup
# import requests

# # Top actors list
# urls = [
#     "https://www.imdb.com/list/ls058011111/?sort=list_order,asc&mode=detail&page=1",
#     "https://www.imdb.com/list/ls058011111/?sort=list_order,asc&mode=detail&page=2",
#     "https://www.imdb.com/list/ls058011111/?sort=list_order,asc&mode=detail&page=3",
#     "https://www.imdb.com/list/ls058011111/?sort=list_order,asc&mode=detail&page=4",
#     "https://www.imdb.com/list/ls058011111/?sort=list_order,asc&mode=detail&page=5",
#     "https://www.imdb.com/list/ls058011111/?sort=list_order,asc&mode=detail&page=6",
#     "https://www.imdb.com/list/ls058011111/?sort=list_order,asc&mode=detail&page=7",
#     "https://www.imdb.com/list/ls058011111/?sort=list_order,asc&mode=detail&page=8",
#     "https://www.imdb.com/list/ls058011111/?sort=list_order,asc&mode=detail&page=9",
#     "https://www.imdb.com/list/ls058011111/?sort=list_order,asc&mode=detail&page=10"
# ]

# pages = [requests.get(url) for url in urls]
# soups = [BeautifulSoup(page.content, 'html.parser') for page in pages]

# actors = {}
# for soup in soups:
#     for item in soup.find_all('h3', class_='lister-item-header'):
#         actor = item.a.text.lstrip(' ').rstrip('\n')
#         nmconst = item.a['href'].split('/name/')[1]
#         actors[actor] = nmconst

# # Top directors list

# urls = ["https://www.imdb.com/list/ls066140407/?sort=list_order,asc&mode=detail&page=1",
#         "https://www.imdb.com/list/ls066140407/?sort=list_order,asc&mode=detail&page=2",
#         "https://www.imdb.com/list/ls066140407/?sort=list_order,asc&mode=detail&page=3",
#         "https://www.imdb.com/list/ls066140407/?sort=list_order,asc&mode=detail&page=4",
#         "https://www.imdb.com/list/ls066140407/?sort=list_order,asc&mode=detail&page=5",
#         "https://www.imdb.com/list/ls066140407/?sort=list_order,asc&mode=detail&page=6",
#         "https://www.imdb.com/list/ls066140407/?sort=list_order,asc&mode=detail&page=7",
#         "https://www.imdb.com/list/ls066140407/?sort=list_order,asc&mode=detail&page=8",
#         "https://www.imdb.com/list/ls066140407/?sort=list_order,asc&mode=detail&page=9",
#         "https://www.imdb.com/list/ls066140407/?sort=list_order,asc&mode=detail&page=10",  
#     ]

# pages = [requests.get(url) for url in urls]
# soups = [BeautifulSoup(page.content, 'html.parser') for page in pages]

# directors = {}
# for soup in soups:
#     for item in soup.find_all('h3', class_='lister-item-header'):
#         director = item.a.text.lstrip(' ').rstrip('\n')
#         nmconst = item.a['href'].split('/name/')[1]
#         directors[director] = nmconst

# actors_df = pd.DataFrame.from_dict(actors, orient = 'index', columns = ['nmconst'])
# directors_df = pd.DataFrame.from_dict(directors, orient = 'index', columns = ['nmconst'])
# actors_df.head()        

# actors_df.to_csv('top1000actors.csv')
# directors_df.to_csv('top1000directors.csv')  
In [27]:
# def count_top_actors(actorlist):
#     # Function that takes list of actor nconst's and returns a count of their occurence in top 1000 list

#     actor_count = 0
#     for actor in actorlist:
#         #if actor in topactors['nconst']:
#         if topactors['nconst'].str.contains(actor).any():
#             actor_count += 1
    
#     return actor_count

# def top_director(director):
#     if topdirectors['nconst'].str.contains(director).any():
#         return 1
#     else:
#         return 0

# def custom_map(data_split):
#     data_split['topactors'] = data_split['actorlist'].map(lambda x : count_top_actors(x), na_action = 'ignore')
#     data_split['topdirector'] = data_split['directors'].map(lambda x: top_director(x), na_action = 'ignore')
#     return data_split

# cores = 14
# partitions = cores

# def parallelize(data, func):
#     data_split = np.array_split(data, partitions)
#     pool = Pool(cores)
#     data = pd.concat(pool.map(func, data_split))
#     pool.close()
#     pool.join()
#     return data
In [36]:
# main_df = pd.read_csv('titles and actors.csv')
# main_df.head()

# topactors = pd.read_csv('top1000actors.csv')
# topdirectors = pd.read_csv('top1000directors.csv')
# topactors.head()

# main_df.drop(columns = 'Unnamed: 0', inplace = True)
# main_df.head()

# main_df['actorlist'] = main_df['actors'].map(lambda x: x.split(','), na_action = 'ignore')
# main_df.head()

# import ipynb
# %%time
# actorcounts = parallelize(main_df, custom_map)
# actorcounts.head()

# plt.hist(actorcounts.topactors, bins =10)
# plt.show()

# actorcounts.to_csv('actorcounts.csv')
In [1]:
# Converting top 1000 actors and top 1000 directors as a parquet file
def convert_csv_to_parquet( src ):
    
    df = pd.read_csv(src)
    df.to_parquet(src.split(".csv")[0]+".parquet", compression='brotli')
In [2]:
%%time
import multiprocessing
from multiprocessing.pool import ThreadPool
import glob

files = glob.glob('*.csv')
pool = ThreadPool(processes=multiprocessing.cpu_count())
pool.map(convert_csv_to_parquet, files)
CPU times: user 10.5 ms, sys: 10.8 ms, total: 21.3 ms
Wall time: 37 ms
Out[2]:
[]
In [38]:
topactors = pd.read_parquet('top1000actors.parquet')
topdirectors = pd.read_parquet('top1000directors.parquet')
print("Top 1000 Actors:")
display(topactors.sample())
print("Top 1000 Directors:")
display(topdirectors.sample())
Top 1000 Actors:
Unnamed: 0 nconst
714 Bruno Ganz nm0004486
Top 1000 Directors:
Unnamed: 0 nconst
373 Gary Shore nm2411495
In [ ]:
#actorcounts.to_csv('actorcounts.csv')
actorcounts.to_parquet('actorcounts.parquet')
In [24]:
actorcounts = pd.read_parquet('actorcounts.parquet')
display(actorcounts.sample(3))
Unnamed: 0 tconst actors directors actorlist topactors topdirector
6157917 6157917 tt2743526 None None None NaN NaN
6876469 6876469 tt5943138 None None None NaN NaN
117003 117003 tt0133325 nm0939454,nm0262647,nm0548010,nm0420011 nm0952761 ['nm0939454', 'nm0262647', 'nm0548010', 'nm042... 0.0 0.0
In [130]:
df_movie = pd.merge(df_movie,actorcounts,on='tconst',how='left')
In [131]:
df_movie.drop(['Unnamed: 0','actors','directors','actorlist'],axis=1,inplace=True)
df_movie.sample()
Out[131]:
tconst primaryTitle originalTitle startYear runtimeMinutes genres averageRating numVotes countries Latitude Longitude new_genres topactors topdirector
15941 tt4469850 Arlo: The Burping Pig Arlo: The Burping Pig 2016.0 80.0 Family 4.3 102.0 Namibia -22.709656 16.721619 Other 0.0 0.0
In [132]:
df_movie = df_movie.fillna(0)
df_movie.isnull().sum()
Out[132]:
tconst            0
primaryTitle      0
originalTitle     0
startYear         0
runtimeMinutes    0
genres            0
averageRating     0
numVotes          0
countries         0
Latitude          0
Longitude         0
new_genres        0
topactors         0
topdirector       0
dtype: int64
In [80]:
df_movie.topactors.value_counts()
Out[80]:
0.0    13393
1.0     2587
2.0     1324
3.0      821
4.0      344
Name: topactors, dtype: int64
In [81]:
df_movie.topdirector.value_counts()
Out[81]:
0.0    15884
1.0     2585
Name: topdirector, dtype: int64


Analysis

Exploratory Data Analysis

We have around 18,500 movies in a data frame, before moving the machine learning part, we wanted to understand the behaviour of each feature. So, firstly we looked at features relation among them and distributions of each of them. Our findings are visualized in the below graphs.

Distribution of Features

To understand each relation between target and predictor features, we categorized our target variable with divided 10 bins. When we looked at the average rating row, there is a clear relationship between the number of votes and average ratings. Therefore, we considered the number of votes as an important feature in our analysis.

In [51]:
nbins = 10
import pandas as pd
df_movie["RatingCat"] = pd.qcut(df_movie["averageRating"], q=nbins, labels=False)

# plot the pairwise scatterplot
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
print("Pairwise Scatterplots of Features")
sns.set(style="ticks")
sns.pairplot(df_movie, hue="RatingCat", palette="RdBu_r")
plt.show()
Pairwise Scatterplots of Features

Histograms

As we can expect, when the year passes, the number of movies increases. Just 2020 has a decrease, the possible reason for it can be pandemic. Average Ratings has a quite normal distribution, while runtimes look normal, but it still has some outliers. When we examine a number of votes, it has lots of values near zero, which may cause a problem in our analysis.

In [52]:
df_movie = df_movie.drop(['RatingCat'],axis=1)
In [53]:
import matplotlib
fig = plt.figure(constrained_layout=True, figsize=(20,10))
gs = matplotlib.gridspec.GridSpec(ncols=2, nrows=2, figure=fig)
df_sns = df_movie[['startYear', 'runtimeMinutes', 'averageRating', 'numVotes']]
print("Histograms")   
for i, column in enumerate(df_sns.columns):
    if df_sns[column].dtype.kind not in 'bifc': continue
    ax = fig.add_subplot( gs[i//2,i%2])
    sns.distplot(df_sns[column],ax=ax).set_title(column)
Histograms

Frequencies

We already examined the start year of the movies above. When we looked at genre distribution, the majority of movies are drama, followed by comedy and action.

In [54]:
fig = plt.figure(constrained_layout=True, figsize=(20,20))
gs = matplotlib.gridspec.GridSpec(ncols=2, nrows=2, figure=fig)
df_sns = df_movie[['startYear', 'new_genres','topactors','topdirector']]
print("Frequencies")   
for i, column in enumerate(df_sns.columns):
    ax = fig.add_subplot( gs[i//2,i%2])
    sns.countplot(df_sns[column],ax=ax).set_title(column)
    ax.tick_params(labelrotation=90)
Frequencies

Detailed Analysis of Top-Bottom Movies according to average ratings and number of votes

For a better understanding of the behaviours of the features, we thought that examing the top rated and bottom rated movies can be useful. For that reason, we limited our data whose number of votes is higher than 100,000. Then, we chose the top hundred and the bottom hundred movies.

In [55]:
df_movie = df_movie.sort_values("averageRating",ascending=False)
df_t = df_movie.loc[df_movie.numVotes >= 100000]
df_top = df_t.head(100)
df_bottom = df_t.tail(100)
print("Top Ten")
display(df_top.head(10))
print("Bottom Ten")
display(df_bottom.head(10))
Top Ten
primaryTitle originalTitle startYear runtimeMinutes genres averageRating numVotes countries Latitude Longitude new_genres topactors topdirector
4720 The Shawshank Redemption The Shawshank Redemption 1994.0 142.0 Drama 9.3 2354543.0 United States of America 37.668954 -102.392565 Drama 2.0 1.0
9002 The Dark Knight The Dark Knight 2008.0 152.0 Action,Crime,Drama 9.0 2314083.0 United States of America 37.668954 -102.392565 Drama 4.0 1.0
9003 The Dark Knight The Dark Knight 2008.0 152.0 Action,Crime,Drama 9.0 2314083.0 United States of America 37.668954 -102.392565 Drama 4.0 1.0
2570 The Godfather: Part II The Godfather: Part II 1974.0 202.0 Crime,Drama 9.0 1135657.0 United States of America 37.668954 -102.392565 Drama 4.0 1.0
2569 The Godfather: Part II The Godfather: Part II 1974.0 202.0 Crime,Drama 9.0 1135657.0 United States of America 37.668954 -102.392565 Drama 4.0 1.0
4710 Pulp Fiction Pulp Fiction 1994.0 154.0 Crime,Drama 8.9 1835129.0 United States of America 37.668954 -102.392565 Drama 4.0 1.0
1744 The Good, the Bad and the Ugly Il buono, il brutto, il cattivo 1966.0 161.0 Western 8.8 691389.0 Italy 41.778108 12.677251 Western 1.0 1.0
4274 Goodfellas Goodfellas 1990.0 146.0 Biography,Crime,Drama 8.7 1026467.0 United States of America 37.668954 -102.392565 Drama 3.0 1.0
4821 Se7en Se7en 1995.0 127.0 Crime,Drama,Mystery 8.6 1453135.0 United States of America 37.668954 -102.392565 Drama 3.0 1.0
9459 Interstellar Interstellar 2014.0 169.0 Adventure,Drama,Sci-Fi 8.6 1522842.0 United States of America 37.668954 -102.392565 Drama 3.0 1.0
Bottom Ten
primaryTitle originalTitle startYear runtimeMinutes genres averageRating numVotes countries Latitude Longitude new_genres topactors topdirector
10847 Resident Evil: Afterlife Resident Evil: Afterlife 2010.0 96.0 Action,Adventure,Horror 5.8 162302.0 United States of America 37.668954 -102.392565 Action 1.0 1.0
10735 2012 2012 2009.0 158.0 Action,Adventure,Sci-Fi 5.8 354394.0 United States of America 37.668954 -102.392565 Action 4.0 1.0
7046 Mr. Deeds Mr. Deeds 2002.0 96.0 Comedy,Romance 5.8 136679.0 United States of America 37.668954 -102.392565 Comedy 3.0 1.0
11133 Teenage Mutant Ninja Turtles Teenage Mutant Ninja Turtles 2014.0 101.0 Action,Adventure,Comedy 5.8 201355.0 United States of America 37.668954 -102.392565 Action 1.0 0.0
5780 Lara Croft: Tomb Raider Lara Croft: Tomb Raider 2001.0 100.0 Action,Adventure,Fantasy 5.8 198319.0 United States of America 37.668954 -102.392565 Action 2.0 1.0
5779 Lara Croft: Tomb Raider Lara Croft: Tomb Raider 2001.0 100.0 Action,Adventure,Fantasy 5.8 198319.0 United States of America 37.668954 -102.392565 Action 2.0 1.0
5781 Lara Croft: Tomb Raider Lara Croft: Tomb Raider 2001.0 100.0 Action,Adventure,Fantasy 5.8 198319.0 United States of America 37.668954 -102.392565 Action 2.0 1.0
8427 Final Destination 3 Final Destination 3 2006.0 93.0 Action,Crime,Horror 5.8 131239.0 United States of America 37.668954 -102.392565 Action 1.0 0.0
8426 Final Destination 3 Final Destination 3 2006.0 93.0 Action,Crime,Horror 5.8 131239.0 United States of America 37.668954 -102.392565 Action 1.0 0.0
12604 Percy Jackson: Sea of Monsters Percy Jackson: Sea of Monsters 2013.0 106.0 Adventure,Family,Fantasy 5.8 110715.0 United States of America 37.668954 -102.392565 Other 2.0 0.0

Pairwise Scatterplots:

We repeated the above processes using the subsets of our data. Here we used 5 bins of run times as a category. Again, there is a positive relationship between average ratings and the number of votes. Additionally, we can see there is a relationship between run times and average ratings, especially among the bottom movies.

In [56]:
nbins = 5

df_top["RuntimeCat"] = pd.qcut(df_top["runtimeMinutes"], q=nbins, labels=False)
            
# plot the pairwise scatterplot

print("Pairwise Scatterplots of Top Hundred Movies")
sns.set(style="ticks")
sns.pairplot(df_top, hue="RuntimeCat", palette="RdBu_r",vars=["startYear", "runtimeMinutes", "averageRating","numVotes"])
plt.show()

df_bottom["RuntimeCat"] = pd.qcut(df_bottom["runtimeMinutes"], q=nbins, labels=False)
print("Pairwise Scatterplots of Bottom Hundred Movies")
sns.set(style="ticks")
sns.pairplot(df_bottom, hue="RuntimeCat", palette="RdBu_r",vars=["startYear", "runtimeMinutes", "averageRating","numVotes"])
plt.show()
Pairwise Scatterplots of Top Hundred Movies
Pairwise Scatterplots of Bottom Hundred Movies

Histograms & Frequencies

As you can see in the average ratings graph, light orange shows bottom movies, while light blue shows top movies. we can see that newly-released movies generally bottom of the list and run times of the bottom list also lower than others. And while the number of votes of top movies has a nearly normal distribution, the bottom movies have right-skewed distribution. There is no quite clear pattern for start year of movies for top and bottom list. And, while drama is the first place in the top list, interestingly action is the first for the bottom list.

In [57]:
fig = plt.figure(constrained_layout=True, figsize=(20,10))
gs = matplotlib.gridspec.GridSpec(ncols=2, nrows=2, figure=fig)
df_sns = df_top[['startYear', 'runtimeMinutes', 'averageRating', 'numVotes']]
print("Histograms for top & bottom hundred movies")   
for i, column in enumerate(df_sns.columns):
    if df_sns[column].dtype.kind not in 'bifc': continue
    ax = fig.add_subplot( gs[i//2,i%2])
    sns.distplot(df_sns[column],ax=ax).set_title(column)

df_sns = df_bottom[['startYear', 'runtimeMinutes', 'averageRating', 'numVotes']] 
for i, column in enumerate(df_sns.columns):
    if df_sns[column].dtype.kind not in 'bifc': continue
    ax = fig.add_subplot( gs[i//2,i%2])
    sns.distplot(df_sns[column],ax=ax).set_title(column)
Histograms for top & bottom hundred movies
In [58]:
fig = plt.figure(constrained_layout=True, figsize=(20,20))
gs = matplotlib.gridspec.GridSpec(ncols=2, nrows=2, figure=fig)
df_sns = df_top[['startYear', 'new_genres','topactors','topdirector']]
print("Frequencies")   
for i, column in enumerate(df_sns.columns):
    ax = fig.add_subplot( gs[i//2,i%2])
    sns.countplot(df_sns[column],ax=ax).set_title(column)
    ax.tick_params(labelrotation=90)
Frequencies
In [59]:
fig = plt.figure(constrained_layout=True, figsize=(20,20))
gs = matplotlib.gridspec.GridSpec(ncols=2, nrows=2, figure=fig)
df_sns = df_bottom[['startYear', 'new_genres','topactors','topdirector']]
print("Frequencies")   
for i, column in enumerate(df_sns.columns):
    ax = fig.add_subplot( gs[i//2,i%2])
    sns.countplot(df_sns[column],ax=ax).set_title(column)
    ax.tick_params(labelrotation=90)
Frequencies

Location Distributions of Average Ratings

After examining the subsets of our data, we also wanted to see the location distributions of average ratings for our whole data. From the choropleth map, we can see that the average rating of movies in Kazakhstan was lowest, while the average rating of movies in Azerbaijan was highest. But these extreme ratings may be due to the limited number of movies in these two countries.

In [42]:
df_map = df_movie.groupby(['countries']).agg({'tconst':"count", 'averageRating':"mean"})

df_map.reset_index(inplace = True)
#df_map = df_map.sort_values(by=['startYear'])
#df_map.startYear = df_map.startYear.astype(int)
In [43]:
fig = px.choropleth(df_map, locations="countries", # used plotly express choropleth for animation plot
                    color="averageRating", 
                    locationmode='country names',
                    hover_name="countries",
                    hover_data=['tconst'],
                    #animation_frame =df_map.startYear,
                    title = 'Location Distributions of Average Ratings  1951 - 2021')

# adjusting size of map, legend place, and background colour
fig.update_layout(
    autosize=False,
    width=1000,
    height=500,
    margin=dict(
        l=50,
        r=50,
        b=100,
        t=100,
        pad=4
    ),
    template='seaborn',
    #paper_bgcolor="rgb(234, 234, 242)",
    legend=dict(
        orientation="v",
        yanchor="auto",
        y=1.02,
        xanchor="right",
        x=1
))

fig.show()
# reference: https://plotly.github.io/plotly.py-docs/generated/plotly.express.choropleth.html

Correlation Matrix

It is clearly seen that there is no high correlation between target value and the other features. Additionally, the highest correlation is between top directors and top actors.

In [111]:
plt.figure(figsize=(10,10))
corrMatrix = df_movie.corr()
sns.heatmap(corrMatrix, annot=True,vmin=-1, vmax=1, center=0,
    cmap=sns.diverging_palette(20, 220, n=200),
    square=True
)

plt.show()
In [61]:
#to work on Talc
df_movie.to_parquet('df_movie')


Machine Learning

In this section, we worked on Talc. We tried a bunch of Regression Models to acquire the best prediction. We listed each model below, and we also tried different methodologies on each model to increase accuracy. Because we have lots of movies that have nearly zero votes, we realized that these values cause to decrease in our accuracy. After we limited the number of votes to a reasonable number, our predictions became better, so we worked on this data set for the other models that we run. Additionally, to compare each models, we used 0.8 and 0.2 train test split each model and also random seed as 42.
The models:

  • Random Forest Regression
  • Linear Regression
  • Generalized Linear Regression
  • Gradient-boosted tree Regression

Sparkcluser on Talc:

In [1]:
# kill previous sparkcluster jobs just in case
try: sc.stop()
except: pass
try: sj.stop()
except: pass
! scancel -u `whoami` -n sparkcluster
In [2]:
import os
import atexit
import sys
import time

import pyspark
from pyspark.context import SparkContext
from pyspark.sql import SQLContext
import findspark
from sparkhpc import sparkjob

#Exit handler to clean up the Spark cluster if the script exits or crashes
def exitHandler(sj,sc):
    try:
        print('Trapped Exit cleaning up Spark Context')
        sc.stop()
    except:
        pass
    try:
        print('Trapped Exit cleaning up Spark Job')
        sj.stop()
    except:
        pass

findspark.init()

# Parameters for the Spark cluster
# At present, we have to reserve an entire node at a time (this is due to an update on the system, and this will be
# addressed in the future)

nodes=1
tasks_per_node=24
memory_per_task=10000 

# Please estimate walltime carefully to keep unused Spark clusters from sitting 
# idle so that others may use the shared resources.
walltime="3:00" # hh:mm, half an hour
os.environ['SBATCH_PARTITION']='cpu24' #Set the appropriate TALC partition

sj = sparkjob.sparkjob(
     ncores=nodes*tasks_per_node,
     cores_per_executor=tasks_per_node,
     memory_per_core=memory_per_task,
     memory_per_executor=memory_per_task-500,
     walltime=walltime
    )

sj.wait_to_start()
time.sleep(60)
sc = sj.start_spark()

#Register the exit handler                                                                                                     
atexit.register(exitHandler,sj,sc)

#You need this line if you want to use SparkSQL
scq=SQLContext(sc)

display(sc)
INFO:sparkhpc.sparkjob:Submitted batch job 3857

INFO:sparkhpc.sparkjob:Submitted cluster 0

SparkContext

Spark UI

Version
v2.4.0
Master
local[*]
AppName
pyspark-shell
In [104]:
#loading parquet file to work on spark
df_movie = scq.read.parquet('df_movie')
df_movie = df_movie.cache()
In [105]:
#getting dummies for genres
from pyspark.sql.functions import when

df_with_extra_columns = df_movie.withColumn("Drama", when(df_movie.new_genres == "Drama", 1)).withColumn("Comedy", when(df_movie.new_genres == "Comedy", 1)).withColumn("Documentary", when(df_movie.new_genres == "Documentary", 1)).withColumn("Action", when(df_movie.new_genres == "Action", 1)).withColumn("Other", when(df_movie.new_genres == "Other", 1)).withColumn("Thriller", when(df_movie.new_genres == "Thriller", 1)).withColumn("Horror", when(df_movie.new_genres == "Horror", 1)).withColumn("Western", when(df_movie.new_genres == "Western", 1))
In [106]:
df_with_extra_columns = df_with_extra_columns.na.fill(value=0)
In [107]:
df_with_extra_columns.toPandas().info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18469 entries, 0 to 18468
Data columns (total 22 columns):
primaryTitle         18469 non-null object
originalTitle        18469 non-null object
startYear            18469 non-null float64
runtimeMinutes       18469 non-null float64
genres               18469 non-null object
averageRating        18469 non-null float64
numVotes             18469 non-null float64
countries            18469 non-null object
Latitude             18469 non-null float64
Longitude            18469 non-null float64
new_genres           18469 non-null object
topactors            18469 non-null float64
topdirector          18469 non-null float64
__index_level_0__    18469 non-null int64
Drama                18469 non-null int32
Comedy               18469 non-null int32
Documentary          18469 non-null int32
Action               18469 non-null int32
Other                18469 non-null int32
Thriller             18469 non-null int32
Horror               18469 non-null int32
Western              18469 non-null int32
dtypes: float64(8), int32(8), int64(1), object(5)
memory usage: 2.5+ MB
In [108]:
df = df_with_extra_columns[[ "startYear", "runtimeMinutes", "numVotes", "Latitude","Longitude","topactors","topdirector","Drama","Comedy","Documentary","Action","Other","Thriller","Horror","Western","averageRating"]]


Random Forest Regression

In [8]:
from pyspark.ml.feature import VectorAssembler
from pyspark.ml import Pipeline
from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml.feature import VectorIndexer
from pyspark.ml.evaluation import RegressionEvaluator
In [9]:
features = ["startYear", "runtimeMinutes", "numVotes", "Latitude","Longitude","Drama","Comedy","Documentary","Action","Other","Thriller","Horror","Western"]
v_asm = VectorAssembler(inputCols=features, outputCol="features")
ml_df1 = v_asm.transform(df_with_extra_columns.select([ "startYear", "runtimeMinutes", "numVotes", "Latitude","Longitude","Drama","Comedy","Documentary","Action","Other","Thriller","Horror","Western","averageRating"])).cache()
ml_df1.show(3, truncate=False)
+---------+--------------+--------+-----------+------------+-----+------+-----------+------+-----+--------+------+-------+-------------+------------------------------------------------------------------+
|startYear|runtimeMinutes|numVotes|Latitude   |Longitude   |Drama|Comedy|Documentary|Action|Other|Thriller|Horror|Western|averageRating|features                                                          |
+---------+--------------+--------+-----------+------------+-----+------+-----------+------+-----+--------+------+-------+-------------+------------------------------------------------------------------+
|2009.0   |83.0          |5.0     |37.66895362|-102.3925645|0    |0     |1          |0     |0    |0       |0     |0      |10.0         |(13,[0,1,2,3,4,7],[2009.0,83.0,5.0,37.66895362,-102.3925645,1.0]) |
|2013.0   |55.0          |6.0     |37.66895362|-102.3925645|0    |0     |1          |0     |0    |0       |0     |0      |9.7          |(13,[0,1,2,3,4,7],[2013.0,55.0,6.0,37.66895362,-102.3925645,1.0]) |
|2020.0   |77.0          |31.0    |37.66895362|-102.3925645|0    |1     |0          |0     |0    |0       |0     |0      |9.6          |(13,[0,1,2,3,4,6],[2020.0,77.0,31.0,37.66895362,-102.3925645,1.0])|
+---------+--------------+--------+-----------+------------+-----+------+-----------+------+-----+--------+------+-------+-------------+------------------------------------------------------------------+
only showing top 3 rows

In [10]:
%%time
# code below was modified from https://spark.apache.org/docs/latest/ml-classification-regression.html#random-forest-classifier



# Load and parse the data file, converting it to a DataFrame.
data = ml_df1

# Automatically identify categorical features, and index them.
# Set maxCategories so features with > 4 distinct values are treated as continuous.
featureIndexer =\
    VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(data)

# Split the data into training and test sets (20% held out for testing)
(trainingData, testData) = data.randomSplit([0.8, 0.2],seed=42)

# Train a RandomForest model.
rf = RandomForestRegressor(featuresCol="indexedFeatures", labelCol='averageRating')

# Chain indexer and forest in a Pipeline
pipeline = Pipeline(stages=[featureIndexer, rf])

# Train model.  This also runs the indexer.
model = pipeline.fit(trainingData)

predictions0 = model.transform(trainingData)

# Make predictions.
predictions = model.transform(testData)

# Select example rows to display.
print("Predictions on train data:")
predictions0.select("prediction", "averageRating", "features").show(5)
# display(predictions0.select("prediction", "median_house_value", "features").limit(5).toPandas())

print("Predictions on test data:")
predictions.select("prediction", "averageRating", "features").show(5)

# Select (prediction, true label) and compute test error
evaluator = RegressionEvaluator(labelCol="averageRating", predictionCol="prediction")
rmse = evaluator.evaluate(predictions, {evaluator.metricName: "rmse"})
r2 = evaluator.evaluate(predictions, {evaluator.metricName: "r2"})
print("Root Mean Squared Error (RMSE) on test data = %g" % rmse)
print("R^2 on test data = %g" % r2)

print( model.stages[1] ) # summary only
Predictions on train data:
+-----------------+-------------+--------------------+
|       prediction|averageRating|            features|
+-----------------+-------------+--------------------+
|5.849454027076801|          6.5|(13,[0,1,2,3,4,12...|
| 5.67723990440501|          5.0|(13,[0,1,2,3,4,6]...|
|5.949190957551703|          7.2|(13,[0,1,2,3,4,5]...|
| 5.67723990440501|          6.2|(13,[0,1,2,3,4,6]...|
| 5.67723990440501|          6.2|(13,[0,1,2,3,4,6]...|
+-----------------+-------------+--------------------+
only showing top 5 rows

Predictions on test data:
+-----------------+-------------+--------------------+
|       prediction|averageRating|            features|
+-----------------+-------------+--------------------+
|5.849454027076801|          6.8|(13,[0,1,2,3,4,12...|
|6.055377008097173|          5.7|(13,[0,1,2,3,4,5]...|
|5.815344009675345|          6.1|(13,[0,1,2,3,4,6]...|
|5.961229411957215|          6.2|(13,[0,1,2,3,4,5]...|
|5.748306467795249|          5.8|(13,[0,1,2,3,4,12...|
+-----------------+-------------+--------------------+
only showing top 5 rows

Root Mean Squared Error (RMSE) on test data = 1.05003
R^2 on test data = 0.304543
RandomForestRegressionModel (uid=RandomForestRegressor_231595945857) with 20 trees
CPU times: user 72.4 ms, sys: 13 ms, total: 85.4 ms
Wall time: 38.6 s

Hyperparameter Tuning

In [11]:
features_cols = df.columns
features_cols.remove('averageRating')
vec= VectorAssembler(inputCols=features_cols, outputCol='features')
df = vec.transform(df)
ml_ready_df = df.select(['averageRating','features'])
ml_ready_df.show(5)
+-------------+--------------------+
|averageRating|            features|
+-------------+--------------------+
|         10.0|(15,[0,1,2,3,4,9]...|
|          9.7|(15,[0,1,2,3,4,9]...|
|          9.6|(15,[0,1,2,3,4,8]...|
|          9.6|(15,[0,1,2,3,4,11...|
|          9.6|(15,[0,1,2,3,4,7]...|
+-------------+--------------------+
only showing top 5 rows

In [12]:
# Train a RandomForest model.
data = ml_ready_df
rf = RandomForestRegressor(featuresCol="features", labelCol='averageRating', numTrees=25, maxDepth=20)
(trainingData, testData) = data.randomSplit([0.8, 0.2],seed=42)
In [14]:
model = rf.fit(trainingData)
predictions = model.transform(testData)
predictions.select("prediction", "averageRating").show(5)
+-----------------+-------------+
|       prediction|averageRating|
+-----------------+-------------+
| 6.49220628172035|          1.1|
|6.311147211542842|          1.2|
|1.768508799547937|          1.3|
|7.278540852624279|          1.6|
|3.497785330996349|          1.6|
+-----------------+-------------+
only showing top 5 rows

In [15]:
evaluator = RegressionEvaluator(labelCol="averageRating", predictionCol="prediction")
rmse = evaluator.evaluate(predictions, {evaluator.metricName: "rmse"})
r2 = evaluator.evaluate(predictions, {evaluator.metricName: "r2"})
print("RMSE: " + str(rmse))
print("R^2: " + str(r2))
RMSE: 0.8998088107129675
R^2: 0.4801991982545629
In [16]:
import pandas as pd
predictions_df = predictions.toPandas()
In [18]:
import matplotlib.pyplot as plt
plt.style.use('ggplot')
plt.plot(predictions_df.averageRating, predictions_df.prediction, 'bo',alpha=.5)
plt.xlabel('averageRating')
plt.ylabel('Prediction')
plt.suptitle("Model Performance RMSE: %f" % rmse)
plt.show()
In [19]:
import pandas as pd
# Convert feature importances to a pandas column
fi_df = pd.DataFrame(model.featureImportances.toArray(),
columns=['importance'])
In [20]:
fi_df['feature'] =pd.Series(features_cols)
fi_df.sort_values(by=['importance'],ascending=False,inplace=True)
In [22]:
fi_df
Out[22]:
importance feature
2 0.240249 numVotes
1 0.164370 runtimeMinutes
0 0.158259 startYear
9 0.112202 Documentary
13 0.093172 Horror
4 0.054539 Longitude
3 0.052880 Latitude
7 0.033746 Drama
5 0.032026 topactors
8 0.015740 Comedy
6 0.014738 topdirector
10 0.010162 Action
12 0.009278 Thriller
11 0.006079 Other
14 0.002559 Western
In [23]:
plt.style.use('ggplot')
plt.bar(fi_df.feature, fi_df.importance, orientation = 'vertical', alpha=.5)
plt.xticks(rotation=90)
plt.ylabel('Importance')
plt.xlabel('Feature')
plt.title('Feature Importances')
plt.show()

trying with higher number of votes:

In [109]:
df9 = df_with_extra_columns.filter(df_with_extra_columns.numVotes >= 100000)
In [110]:
df = df9[[ "startYear", "runtimeMinutes", "numVotes", "Latitude","Longitude","topactors","topdirector","Drama","Comedy","Documentary","Action","Other","Thriller","Horror","Western","averageRating"]]
In [111]:
features_cols = df.columns
features_cols.remove('averageRating')
vec= VectorAssembler(inputCols=features_cols, outputCol='features')
df = vec.transform(df)
ml_ready_df = df.select(['averageRating','features'])
ml_ready_df.show(5)
+-------------+--------------------+
|averageRating|            features|
+-------------+--------------------+
|          9.3|(15,[0,1,2,3,4,5,...|
|          9.0|(15,[0,1,2,3,4,5,...|
|          9.0|(15,[0,1,2,3,4,5,...|
|          9.0|(15,[0,1,2,3,4,5,...|
|          9.0|(15,[0,1,2,3,4,5,...|
+-------------+--------------------+
only showing top 5 rows

In [135]:
# Train a RandomForest model.
data = ml_ready_df
rf = RandomForestRegressor(featuresCol="features", labelCol='averageRating', numTrees=65, maxDepth=30)
(trainingData, testData) = data.randomSplit([0.8, 0.2],seed=42)
model = rf.fit(trainingData)
predictions = model.transform(testData)
predictions.select("prediction", "averageRating").show(5)

evaluator = RegressionEvaluator(labelCol="averageRating", predictionCol="prediction")
rmse = evaluator.evaluate(predictions, {evaluator.metricName: "rmse"})
r2 = evaluator.evaluate(predictions, {evaluator.metricName: "r2"})
print("RMSE: " + str(rmse))
print("R^2: " + str(r2))
+------------------+-------------+
|        prediction|averageRating|
+------------------+-------------+
| 3.526193162393165|          2.4|
|3.0255521367521387|          2.8|
|3.0255521367521387|          2.8|
| 4.874372973431794|          4.7|
|6.0406336799136655|          4.8|
+------------------+-------------+
only showing top 5 rows

RMSE: 0.37554119175566514
R^2: 0.816909555808465
In [138]:
import pandas as pd
predictions_df = predictions.toPandas()
In [139]:
plt.style.use('ggplot')

plt.plot(predictions_df.averageRating, predictions_df.prediction, 'bo', alpha=.5)
plt.xlabel('averageRating')
plt.ylabel('Prediction')
plt.suptitle("Model Performance RMSE: %f" % rmse)
plt.show()
In [140]:
model = rf.fit(trainingData)
predictions = model.transform(ml_ready_df)
predictions.select("prediction", "averageRating").show(5)
+-----------------+-------------+
|       prediction|averageRating|
+-----------------+-------------+
|8.873959753921293|          9.3|
|8.783856532356534|          9.0|
|8.783856532356534|          9.0|
|8.661217349709666|          9.0|
|8.661217349709666|          9.0|
+-----------------+-------------+
only showing top 5 rows

In [141]:
predictions_df = predictions.toPandas()
In [142]:
plt.style.use('ggplot')

plt.plot(predictions_df.averageRating, predictions_df.prediction, 'bo', alpha=.5)
plt.xlabel('averageRating')
plt.ylabel('Prediction')
plt.suptitle("Model Performance on the whole data set")
plt.show()
In [40]:
import pandas as pd
# Convert feature importances to a pandas column
fi_df = pd.DataFrame(model.featureImportances.toArray(),
columns=['importance'])
In [41]:
fi_df['feature'] =pd.Series(features_cols)
fi_df.sort_values(by=['importance'],ascending=False,inplace=True)
In [42]:
fi_df
Out[42]:
importance feature
2 0.322338 numVotes
0 0.234757 startYear
1 0.157531 runtimeMinutes
5 0.069434 topactors
7 0.052413 Drama
10 0.050197 Action
4 0.027155 Longitude
6 0.024814 topdirector
3 0.022620 Latitude
8 0.014798 Comedy
13 0.011075 Horror
12 0.007766 Thriller
11 0.005025 Other
14 0.000076 Western
9 0.000000 Documentary
In [58]:
plt.style.use('ggplot')
plt.bar(fi_df.feature, fi_df.importance, orientation = 'vertical',alpha=.5)
plt.xticks(rotation=90)
plt.ylabel('Importance')
plt.xlabel('Feature')
plt.title('Feature Importances')
plt.show()


Linear Regression

In [31]:
df9 = df_with_extra_columns.filter(df_with_extra_columns.numVotes >= 100000)
In [65]:
df = df9[[ "startYear", "runtimeMinutes", "numVotes", "Latitude","Longitude","topactors","topdirector","Drama","Comedy","Documentary","Action","Other","Thriller","Horror","Western","averageRating"]]
In [136]:
# reference: https://runawayhorse001.github.io/LearningApacheSpark/regression.html

from pyspark.sql import Row
from pyspark.ml.linalg import Vectors

# I provide two ways to build the features and labels

# Method 2 (good for large features):
def transData(data):
    return data.rdd.map(lambda r: [Vectors.dense(r[:-1]),r[-1]]).toDF(['features','label'])
In [137]:
transformed= transData(df)
transformed.show(5)
+--------------------+--------------------+
|            features|               label|
+--------------------+--------------------+
|[1994.0,142.0,235...|(15,[0,1,2,3,4,5,...|
|[1974.0,202.0,113...|(15,[0,1,2,3,4,5,...|
|[1974.0,202.0,113...|(15,[0,1,2,3,4,5,...|
|[2008.0,152.0,231...|(15,[0,1,2,3,4,5,...|
|[2008.0,152.0,231...|(15,[0,1,2,3,4,5,...|
+--------------------+--------------------+
only showing top 5 rows

In [68]:
from pyspark.ml import Pipeline
from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import VectorIndexer
from pyspark.ml.evaluation import RegressionEvaluator

# Automatically identify categorical features, and index them.
# We specify maxCategories so features with > 4 distinct values are treated as continuous.

featureIndexer = VectorIndexer(inputCol="features", \
                               outputCol="indexedFeatures",\
                               maxCategories=4).fit(transformed)

data = featureIndexer.transform(transformed)
In [69]:
data.show(5,True)
+--------------------+-----+--------------------+
|            features|label|     indexedFeatures|
+--------------------+-----+--------------------+
|[1994.0,142.0,235...|  9.3|[1994.0,142.0,235...|
|[1974.0,202.0,113...|  9.0|[1974.0,202.0,113...|
|[1974.0,202.0,113...|  9.0|[1974.0,202.0,113...|
|[2008.0,152.0,231...|  9.0|[2008.0,152.0,231...|
|[2008.0,152.0,231...|  9.0|[2008.0,152.0,231...|
+--------------------+-----+--------------------+
only showing top 5 rows

In [91]:
#reference: https://spark.apache.org/docs/latest/ml-tuning.html
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.regression import LinearRegression
from pyspark.ml.tuning import ParamGridBuilder, TrainValidationSplit

# Prepare training and test data.
data = data
train, test = data.randomSplit([0.8, 0.2], seed=42)

lr = LinearRegression(maxIter=10)

# We use a ParamGridBuilder to construct a grid of parameters to search over.
# TrainValidationSplit will try all combinations of values and determine best model using
# the evaluator.
paramGrid = ParamGridBuilder()\
    .addGrid(lr.regParam, [0.1, 0.005]) \
    .addGrid(lr.fitIntercept, [False, True])\
    .addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0])\
    .build()

# In this case the estimator is simply the linear regression.
# A TrainValidationSplit requires an Estimator, a set of Estimator ParamMaps, and an Evaluator.
tvs = TrainValidationSplit(estimator=lr,
                           estimatorParamMaps=paramGrid,
                           evaluator=RegressionEvaluator(),
                           # 80% of the data will be used for training, 20% for validation.
                           trainRatio=0.5)

# Run TrainValidationSplit, and choose the best set of parameters.
model = tvs.fit(train)

# Make predictions on test data. model is the model with combination of parameters
# that performed best.
model.transform(test)\
    .select("features", "label", "prediction")\
    .show(5)
+--------------------+-----+-----------------+
|            features|label|       prediction|
+--------------------+-----+-----------------+
|[1954.0,108.0,142...|  8.1| 8.19882180188656|
|[1954.0,108.0,142...|  8.1| 8.19882180188656|
|[1954.0,108.0,142...|  8.1| 8.19882180188656|
|[1959.0,136.0,300...|  8.3|8.237284244654845|
|[1964.0,95.0,4529...|  8.4|8.016662113301628|
+--------------------+-----+-----------------+
only showing top 5 rows

In [92]:
# Make predictions.
predictions = model.transform(test)
In [93]:
# Select example rows to display.
predictions.select("features","label","prediction").show(5)
from pyspark.ml.evaluation import RegressionEvaluator
# Select (prediction, true label) and compute test error
evaluator = RegressionEvaluator(labelCol="label",
                                predictionCol="prediction",
                                metricName="rmse")

rmse = evaluator.evaluate(predictions)
print("Root Mean Squared Error (RMSE) on test data = %g" % rmse)
y_true = predictions.select("label").toPandas()
y_pred = predictions.select("prediction").toPandas()

import sklearn.metrics
r2_score = sklearn.metrics.r2_score(y_true, y_pred)
print('r2_score: {0}'.format(r2_score))
+--------------------+-----+-----------------+
|            features|label|       prediction|
+--------------------+-----+-----------------+
|[1954.0,108.0,142...|  8.1| 8.19882180188656|
|[1954.0,108.0,142...|  8.1| 8.19882180188656|
|[1954.0,108.0,142...|  8.1| 8.19882180188656|
|[1959.0,136.0,300...|  8.3|8.237284244654845|
|[1964.0,95.0,4529...|  8.4|8.016662113301628|
+--------------------+-----+-----------------+
only showing top 5 rows

Root Mean Squared Error (RMSE) on test data = 0.584809
r2_score: 0.5053604999664373


Generalized linear regression

In [58]:
# Import LinearRegression class
from pyspark.ml.regression import GeneralizedLinearRegression

# Define LinearRegression algorithm
glr = GeneralizedLinearRegression(family="gaussian", link="identity",\
                                  maxIter=10, regParam=0.3)
In [59]:
# Chain indexer and tree in a Pipeline
pipeline = Pipeline(stages=[featureIndexer, glr])

model = pipeline.fit(trainingData)
In [62]:
# Make predictions.
predictions = model.transform(testData)
In [63]:
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.evaluation import RegressionEvaluator
# Select (prediction, true label) and compute test error
evaluator = RegressionEvaluator(labelCol="label",
                                predictionCol="prediction",
                                metricName="rmse")

rmse = evaluator.evaluate(predictions)
print("Root Mean Squared Error (RMSE) on test data = %g" % rmse)
y_true = predictions.select("label").toPandas()
y_pred = predictions.select("prediction").toPandas()

import sklearn.metrics
r2_score = sklearn.metrics.r2_score(y_true, y_pred)
print('r2_score: {0}'.format(r2_score))
Root Mean Squared Error (RMSE) on test data = 0.609033
r2_score: 0.46353519363375806


Gradient-boosted tree regression

In [64]:
# Import LinearRegression class
from pyspark.ml.regression import GBTRegressor

# Define LinearRegression algorithm
rf = GBTRegressor() #numTrees=2, maxDepth=2, seed=42
In [65]:
# Chain indexer and tree in a Pipeline
pipeline = Pipeline(stages=[featureIndexer, rf])
model = pipeline.fit(trainingData)
In [66]:
predictions = model.transform(testData)

# Select example rows to display.
predictions.select("features","label", "prediction").show(5)
+--------------------+-----+-----------------+
|            features|label|       prediction|
+--------------------+-----+-----------------+
|[1954.0,108.0,142...|  8.1|  7.8084094931346|
|[1954.0,108.0,142...|  8.1|  7.8084094931346|
|[1954.0,108.0,142...|  8.1|  7.8084094931346|
|[1959.0,136.0,300...|  8.3|8.280351514153912|
|[1964.0,95.0,4529...|  8.4| 8.35262431547133|
+--------------------+-----+-----------------+
only showing top 5 rows

In [67]:
# Select (prediction, true label) and compute test error
evaluator = RegressionEvaluator(
    labelCol="label", predictionCol="prediction", metricName="rmse")
rmse = evaluator.evaluate(predictions)
print("Root Mean Squared Error (RMSE) on test data = %g" % rmse)

import sklearn.metrics
r2_score = sklearn.metrics.r2_score(y_true, y_pred)
print('r2_score: {:4.3f}'.format(r2_score))
Root Mean Squared Error (RMSE) on test data = 0.487461
r2_score: 0.464


Conclusion

When considering that our guiding questions respectively,

  • Which features can be effective in the prediction of ratings?

In the first step, we tried to figure out which features can help to predict ratings. For this aim, we have some features of movies. These are title, region, language, types, primary title, original title, start year, run time, genres, average ratings, number of votes, countries information directly from IMDb. Firstly, we grouped some of them for elimination. We used type as a limiting factor of data set, region, language, and countries are location information so we decided to add latitude, longitude information of countries to use in the analysis. we did not prefer to use title information in our analysis. Additionally, we have actors and directors information, we used this information converting them as several top actors and the number of top directors. So, before beginning to run machine learning models, we have these features: start year, run time, genres, average ratings, number of votes, latitude, longitude, top actors and top directors.

  • How is the relation between ratings and other features?

After choosing features, to examine futures we made exploratory data analysis, and we looked at each feature of behaviour. When we draw scatter plots of them, we realized that it looks like the number of votes and ratings has positive relation. Additionally, we saw the same situation in the correlation matrix, and we guessed that number of votes can be the most important feature for our analysis. Then, We also looked at each of the distributions. Drama has the highest movie genre and the movie numbers roughly increase year by year.

  • How is the behaviour of top and bottom movies' features?

We thought that the top-rated and bottom rated movies can give clue for our analysis, so we decided to look at closely. And we realized that they have quite different behaviours. While the run time of bottom movies shorter than the top, the top movies have a far higher number of votes. Also, generally, top movies are older than the others. Additionally, interestingly while the majority of top movies are drama, the majority of bottom movies are action. And, we also expected that top actors and directors should have mostly in top movies, however, the distribution of top actors and directors are quite similar for top and bottom movies.

  • The performance of our Machine Learning Models:

After investigation of our guiding question, we made models based on these answers, and we found the below results. As you can see, the random forest gives us the best model. And, when we did exploratory data analysis, we realized that the number of votes has a very right-skewed distribution, so lots of movies have very few votes. So, we also tried our model on limiting data set with the higher number of votes, it gave us far more good results in random forest and the other models. Even if when we trained our best model on the whole data set, we saw that the accuracy is better than the other models.

Model Name RMSE $R^2$
Random Forest Regression (with the whole data) 1.05 0.31
Random Forest Regression (with best hyperparameters) 0.89 0.48
Random Forest Regression (with higher number of Votes) 0.37 0.82
Linear Regression (with higher number of Votes) 0.58 0.51
Generalized Linear Regression (with higher number of Votes) 0.61 0.46
Gradient-boosted tree Regression (with higher number of Votes) 0.49 0.46
  • Discussion:

We learned lots of things during doing this project. Deal with such a big data set was the biggest challenge for us. Then, even if our data set source quite popular, the data was very dirty and most of part of our analysis passed with engaging with cleaning. Lastly, even if we tried lots of features and models for our project, still it can be tried more things. For improving our analysis, it can be tried to add new features like text analysis of titles, different genre lists, and also different movie features. And also it can be tried to run other more advanced machine learning models like neural networks.


References

[1] Contents - Learning Apache Spark with Python documentation. (n.d.).
Https://Runawayhorse001.Github.Io/LearningApacheSpark/Index.Html. Retrieved March 27, 2021, from
https://runawayhorse001.github.io/LearningApacheSpark/index.html

[2] MLlib: Main Guide - Spark 3.1.1 Documentation. (n.d.).
Https://Spark.Apache.Org/Docs/Latest/Ml-Guide.Html. Retrieved March 27, 2021, from
https://spark.apache.org/docs/latest/ml-guide.html

[3] Wenig, B., Damji, J. S., Das, T., & Lee, D. (2020). Learning Spark (1st ed., Vol. 1). Van Duuren Media.

[4] M. Marović, M. Mihoković, M. Mikša, S. Pribil and A. Tus, "Automatic movie ratings prediction using machine learning," 2011 Proceedings of the 34th International Convention MIPRO, Opatija, Croatia, 2011, pp. 1640-1645.

In [ ]: